import pandas as pd
import numpy as np
from pathlib import Path
import requests
import pandas_datareader as pdr
import yfinance as yf
L07: Pandas I/O
Preliminaries
If you have not done so already (you were asked to do this in lecture01
):
- Open a Terminal, type the following command and hit enter:
- pip install yfinance pandas-datareader
- Open a Terminal, type the following command and hit enter:
- conda install -y openpyxl xlrd
Relative vs absolute paths
In this lecture, we will have to tell Python exactly where on our drive it should store, some files we create. We can do this by specifying the full path to these locations, for example:
‘C:/Users/ionmi/Dropbox/TEACHING’
The above is an absolute path: it contains the full path to the TEACHING folder on my drive.
An alternative way to specify a path (which we will use very often in this course) is to specify that path relative to the current working directory. To do this, we use a combination of one or more dots (‘.’) and/or slashes (‘/’) which have the following meaning:
‘.’ means the current working directory (in our case, this is the directory where these lecture notes are stored on the drive).
‘..’ means the parent of the current working directory.
‘../..’ means the parent of the parent of the current working directory.
‘../../..’ means the parent of the parent of the parent of the current working directory.
etc.
To see this in practice, we can use the Path
function in the pathlib
package, which allows us to see the absolute path of a given relative path:
'.').resolve() #this will give you a different output than mine Path(
PosixPath('/data/imb/Dropbox/0teaching/UA_2023_Spring_525/lectures/lecture08_pandas_io')
'..').resolve() Path(
PosixPath('/data/imb/Dropbox/0teaching/UA_2023_Spring_525/lectures')
'../..').resolve() Path(
PosixPath('/data/imb/Dropbox/0teaching/UA_2023_Spring_525')
Reading and writing .pkl (pickle) files
Python has a proprietary data format called “pickle”. These types of files have the extension “.pkl”. Saving and loading data from pickle files is significantly faster than from/to “.csv”, so we will be using it quite a bit throughout the course.
= pd.DataFrame(data=np.random.rand(5,4), columns = list('ABCD')) #you'll get different numbers every time you re-run this
df df
A | B | C | D | |
---|---|---|---|---|
0 | 0.377738 | 0.380756 | 0.822901 | 0.685419 |
1 | 0.134739 | 0.608980 | 0.171573 | 0.064065 |
2 | 0.208184 | 0.155670 | 0.083572 | 0.922087 |
3 | 0.796854 | 0.193076 | 0.262668 | 0.639699 |
4 | 0.927714 | 0.290597 | 0.763463 | 0.390035 |
.to_pickle()
To store this data in a .pkl file, we use the “.to_pickle
” function, applied right after the name of the dataframe which contains the data we want to store.
Syntax:
='infer', protocol=5, storage_options=None) DataFrame.to_pickle(path, compression
Note that the first argument (path
) is mandatory (it has no default value). This argument is where you specify the name of the .pkl file you want to create (mydata.pkl
below) and the location (directory) where this file should be stored (.
below) all in a single string, separated by /
.
'./mydata.pkl') df.to_pickle(
Note that we can also compress the file:
'./mydata.zip') df.to_pickle(
.read_pickle()
To read data from an existing .pkl file, we use the “.read_pickle
” function, specifying as an argument the path to the file we want to read (including its name):
Syntax:
='infer', storage_options=None) pandas.read_pickle(filepath_or_buffer, compression
For example, if we want to read the contents of the .pkl file we just created above, and store those contents into a new variable df2
, we would use:
= pd.read_pickle('./mydata.pkl') df2
df2
A | B | C | D | |
---|---|---|---|---|
0 | 0.377738 | 0.380756 | 0.822901 | 0.685419 |
1 | 0.134739 | 0.608980 | 0.171573 | 0.064065 |
2 | 0.208184 | 0.155670 | 0.083572 | 0.922087 |
3 | 0.796854 | 0.193076 | 0.262668 | 0.639699 |
4 | 0.927714 | 0.290597 | 0.763463 | 0.390035 |
And we can read compressed .pkl files too:
= pd.read_pickle('./mydata.zip') df2
df2
A | B | C | D | |
---|---|---|---|---|
0 | 0.377738 | 0.380756 | 0.822901 | 0.685419 |
1 | 0.134739 | 0.608980 | 0.171573 | 0.064065 |
2 | 0.208184 | 0.155670 | 0.083572 | 0.922087 |
3 | 0.796854 | 0.193076 | 0.262668 | 0.639699 |
4 | 0.927714 | 0.290597 | 0.763463 | 0.390035 |
Note a very important difference in how we use the two functions above. The syntax for .to_pickle()
starts with DataFrame.to_pickle
which tells us that the function must be applied to and existing DataFrame. On the other hand, the syntax for .read_pickle()
starts with pandas.read_pickle
, which we converted to pd.read_pickle
, because we imported pandas as pd in the first cell code in this notebook (at the top).
This pattern is the same for all the read-write functions we discuss in this lecture: the write functions (.to_pickle()
, .to_csv()
, .to_excel()
) are written after the name of the dataframe we want to write to a file, while the read functions (.read_pickle()
, .read_csv()
, .read_excel()
) follow the name of the pandas package (which we renamed to pd
above).
Reading and writing .txt and .csv files
The most common way to read and write dataframes from/to .csv and .txt files is with the Pandas functions “.to_csv()
” (for writing) and “.read_csv()
” (for reading).
.to_csv()
Here is the abbreviated version of the syntax for .to_csv()
excluding parameters that are not used as often:
=None, sep=',', columns=None, header=True, index=True, index_label=None) DataFrame.to_csv(path_or_buf
Note that the default separator is a comma (sep=','
) which means we can omit that parameter when we write .csv files. The columns
parameter allows you to specify which columns of the dataframe you want to write in the .csv file.
'./mydata.csv', columns = ['B','C']) df.to_csv(
To write tab-delimited .txt files, we use sep='\t'
and change the file extension to .txt
:
'./mydata.txt', sep='\t') df.to_csv(
To write space-delimited .txt files, we use sep=' '
(though I always recommend using tabs for .txt files):
'./mydata_space.txt', sep=' ') df.to_csv(
.read_csv()
Here is the abbreviated version of the syntax for .read_csv()
excluding parameters that are not used as often:
=',', header='infer', names=None, index_col=None, usecols=None, nrows=None,
pandas.read_csv(filepath_or_buffer, sep= None) skiprows
Note that the default separator is a comma (sep=','
) which means we can omit that parameter when we read .csv files:
= pd.read_csv('./mydata.csv')
df3 df3
Unnamed: 0 | B | C | |
---|---|---|---|
0 | 0 | 0.380756 | 0.822901 |
1 | 1 | 0.608980 | 0.171573 |
2 | 2 | 0.155670 | 0.083572 |
3 | 3 | 0.193076 | 0.262668 |
4 | 4 | 0.290597 | 0.763463 |
Note however, that we did not specify that the first column is just an index for the table, so that first column was just included as data in the table itself. Note also that .read_csv()
guessed that the column names are on the first row, because the default value of the header
parameter is infer
. To be save, I always recommend being explicit about where the row names and column names are (remember, Python starts counting from 0:
= pd.read_csv('./mydata.csv', header = 0, index_col = 0)
df3 df3
B | C | |
---|---|---|
0 | 0.380756 | 0.822901 |
1 | 0.608980 | 0.171573 |
2 | 0.155670 | 0.083572 |
3 | 0.193076 | 0.262668 |
4 | 0.290597 | 0.763463 |
To read tab-delimited .txt files, we use sep='\t'
:
= pd.read_csv('./mydata.txt', sep='\t', header = 0, index_col = 0)
df4 df4
A | B | C | D | |
---|---|---|---|---|
0 | 0.377738 | 0.380756 | 0.822901 | 0.685419 |
1 | 0.134739 | 0.608980 | 0.171573 | 0.064065 |
2 | 0.208184 | 0.155670 | 0.083572 | 0.922087 |
3 | 0.796854 | 0.193076 | 0.262668 | 0.639699 |
4 | 0.927714 | 0.290597 | 0.763463 | 0.390035 |
To read space-delimited .txt files, we use sep=' '
:
= pd.read_csv('./mydata_space.txt', sep=' ', header = 0, index_col = 0)
df5 df5
A | B | C | D | |
---|---|---|---|---|
0 | 0.377738 | 0.380756 | 0.822901 | 0.685419 |
1 | 0.134739 | 0.608980 | 0.171573 | 0.064065 |
2 | 0.208184 | 0.155670 | 0.083572 | 0.922087 |
3 | 0.796854 | 0.193076 | 0.262668 | 0.639699 |
4 | 0.927714 | 0.290597 | 0.763463 | 0.390035 |
The .to_csv()
and .read_csv()
functions have a lot more useful parameters. In the practice problems for this lecture, you will be asked to investigate some of them on your own by reading the official documentation:
.to_csv()
: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html.read_csv()
: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Reading and writing .xlsx files
We’ll use .to_excel()
to write Excel files and .read_excel()
to read Excel files. The biggest difference from the .csv functions is that, with the excel ones, you can specify a particular sheet in the the Excel file that you want to read/write.
.to_excel()
Here is an abbreviate version of the syntax for .to_excel()
:
='Sheet1', columns=None) DataFrame.to_excel(excel_writer, sheet_name
'./mydata_excel.xlsx', sheet_name = 's1', columns = ['A', 'C']) df.to_excel(
.read_excel()
Here is an abbreviated version of the syntax for .read_excel()
:
=0, header=0, names=None, index_col=None, usecols=None, skiprows=None, nrows=None) pandas.read_excel(io, sheet_name
= pd.read_excel('./mydata_excel.xlsx', sheet_name = 's1', header = 0, index_col=0, engine = 'openpyxl')
df6 df6
A | C | |
---|---|---|
0 | 0.377738 | 0.822901 |
1 | 0.134739 | 0.171573 |
2 | 0.208184 | 0.083572 |
3 | 0.796854 | 0.262668 |
4 | 0.927714 | 0.763463 |
The .to_excel()
and .read_excel()
functions have a lot more useful parameters. In the practice problems for this lecture, you will be asked to investigate some of them on your own by reading the official documentation:
.to_excel()
: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html.read_excel()
: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
Some important data-acquisition packages
The “requests” package
The requests
package allows us to retrieve data from websites. If you want a more detailed discussion of the full functionality of this package, see the documentation at https://docs.python-requests.org/en/latest/.
Here, we’ll just see how we can use the package to download data from files hosted on websites. For this, we need to URL to the file we want to download. In the example below, I use data on economic policy uncertainty in the US from this website:
https://www.policyuncertainty.com/us_monthly.html
If you right-click on the “Download Data” link and select “Copy Link Address”, you should see the link below when you paste it in your code:
= "https://www.policyuncertainty.com/media/US_Policy_Uncertainty_Data.xlsx" url
We use the .get()
function to retrieve the (binary) data from the URL above:
= requests.get(url) r
We can check if the request was successful using the status_code
attribute. 200 means the request was successful, 404 means there was an error.
r.status_code
200
To write the data we retrieved into an Excel file on our computer, we use the Python built-in open()
function, specifying the path to the file we want to write the data to ('./policy_uncertainty.xlsx
below), specifying that we are writing binary data in it (wb
below):
= open('./policy_uncertainty.xlsx', 'wb') outfile
type(outfile)
_io.BufferedWriter
Now we can write the data into that file using the Python built-in write()
function and then closing that file with the close()
function. Note that the actual data from the URL above is found under the content
attribute of the request r
that we created above.
outfile.write(r.content) outfile.close()
We can check if this process worked by either manually opening the policy_uncertainty.xlsx
in our working directory, or by using pd.read_excel()
to just read the data into a dataframe and take a look at it:
= pd.read_excel('./policy_uncertainty.xlsx').dropna() #.dropna() means drop rows with missing values
df df
Year | Month | Three_Component_Index | News_Based_Policy_Uncert_Index | |
---|---|---|---|---|
0 | 1985 | 1.0 | 125.224740 | 103.748803 |
1 | 1985 | 2.0 | 99.020809 | 78.313193 |
2 | 1985 | 3.0 | 112.190509 | 100.761475 |
3 | 1985 | 4.0 | 102.811319 | 84.778863 |
4 | 1985 | 5.0 | 120.082716 | 98.053653 |
... | ... | ... | ... | ... |
452 | 2022 | 9.0 | 174.183475 | 201.738489 |
453 | 2022 | 10.0 | 177.423127 | 207.275335 |
454 | 2022 | 11.0 | 171.737257 | 210.374894 |
455 | 2022 | 12.0 | 136.502767 | 150.156098 |
456 | 2023 | 1.0 | 143.894044 | 162.788431 |
457 rows × 4 columns
The “pandas_datareader” package
The pandas_datareader
package allows us to download data from many different sources on the internet. Here is a list of all these sources:
https://pandas-datareader.readthedocs.io/en/latest/readers/index.html
The general syntax to download data from a particular source is as follows:
Syntax:
=None,start=None,end=None) pandas_datareader.DataReader(name,data_source
The two sources I will cover here are the St. Louis Federal Reserve Economic Data (FRED) (data_source = 'fred'
) which contains a lot of useful macroeconomic data, and the Fama-French Data (Ken French’s Data Library) (data_source = 'famafrench'
) which contains returns on many portfolios commonly used in asset pricing (e.g. the market portfolio, SMB, HML, etc).
For both of these sources, we use the name
parameter to specify what exactly we want to download from these data sources.
For example, to download data on the the CPI from FRED, we need to use name = 'CPIAUCSL'
which is the internal name that FRED uses for the CPI data:
https://fred.stlouisfed.org/series/CPIAUCSL
= pdr.DataReader(name = 'CPIAUCSL', data_source = 'fred', start = '2020-09-01', end = '2020-12-31')
cpi cpi
CPIAUCSL | |
---|---|
DATE | |
2020-09-01 | 260.190 |
2020-10-01 | 260.352 |
2020-11-01 | 260.721 |
2020-12-01 | 261.564 |
To download data on the Fama-French three risk factors (market, SMB, and HML) we use name = 'F_Research_Data_Factors'"
which is the name of the text file containing these factors on Ken French’s website:
https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html
= pdr.DataReader(name ='F-F_Research_Data_Factors', data_source='famafrench', start='2020-09-01', end='2020-12-31')
ff3f ff3f
{0: Mkt-RF SMB HML RF
Date
2020-09 -3.63 0.04 -2.68 0.01
2020-10 -2.10 4.36 4.21 0.01
2020-11 12.47 5.82 2.14 0.01
2020-12 4.63 4.89 -1.51 0.01,
1: Mkt-RF SMB HML RF
Date
2020 23.66 13.17 -46.57 0.45,
'DESCR': 'F-F Research Data Factors\n-------------------------\n\nThis file was created by CMPT_ME_BEME_RETS using the 202212 CRSP database. The 1-month TBill return is from Ibbotson and Associates, Inc. Copyright 2022 Kenneth R. French\n\n 0 : (4 rows x 4 cols)\n 1 : Annual Factors: January-December (1 rows x 4 cols)'}
Note that for the ‘famafrench’ data source, the ‘DataReader’ function return a dictionary of dataframes, not a single pandas dataframe. That’s because the ‘F-F_Research_Data_Factors’ contains multiple tables. The monthly returns on the Fama-French risk factors are in the first entry in that dictionary (the 0 key), so we can retrieve it like this:
0] ff3f[
Mkt-RF | SMB | HML | RF | |
---|---|---|---|---|
Date | ||||
2020-09 | -3.63 | 0.04 | -2.68 | 0.01 |
2020-10 | -2.10 | 4.36 | 4.21 | 0.01 |
2020-11 | 12.47 | 5.82 | 2.14 | 0.01 |
2020-12 | 4.63 | 4.89 | -1.51 | 0.01 |
There is no easy way to know under what name
you can find the data you need. You have to look at the FRED and Fama-French websites first, to see what names those websites use for the data you need and then type those names into your code, like we did above.
The “yfinance” package
The yfinance package allows us to retrieve stock price data from Yahoo Finance. The full documentation for the package can be found here: https://pypi.org/project/yfinance/ (especially, look under “Fetching data for multiple tickers” on the main page).
Syntax:
= None, end = None, interval = '1d') yfinance.download(tickers, start
For example, to retrieve monthly stock prices for Microsoft and Apple, we need to supply their tickers in a single string (separated by a space) as the first parameter to the download
function and change the interval
parameter to 1mo
(otherwise it will give us daily data):
= yf.download(tickers = "MSFT AAPL", start = '2020-09-01', end = '2020-12-31', interval = '1mo')
prc prc
[*********************100%***********************] 2 of 2 completed
Adj Close | Close | High | Low | Open | Volume | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
AAPL | MSFT | AAPL | MSFT | AAPL | MSFT | AAPL | MSFT | AAPL | MSFT | AAPL | MSFT | |
Date | ||||||||||||
2020-09-01 | 114.239166 | 206.105713 | 115.809998 | 210.330002 | 137.979996 | 232.860001 | 103.099998 | 196.250000 | 132.759995 | 225.509995 | 3.885245e+09 | 768176300.0 |
2020-10-01 | 107.383453 | 198.403580 | 108.860001 | 202.470001 | 125.389999 | 225.210007 | 107.720001 | 199.619995 | 117.639999 | 213.490005 | 2.894666e+09 | 631618000.0 |
2020-11-01 | 117.435226 | 209.770584 | 119.050003 | 214.070007 | 121.989998 | 228.119995 | 107.320000 | 200.119995 | 109.110001 | 204.289993 | 2.123077e+09 | 573443000.0 |
2020-11-06 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2020-11-18 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2020-12-01 | 131.116043 | 218.523499 | 132.690002 | 222.419998 | 138.789993 | 227.179993 | 120.010002 | 209.110001 | 121.010002 | 214.509995 | 2.322190e+09 | 594761700.0 |
We will always drop missing values (with .dropna()
) and use Adj Close prices (prices adjusted for dividends and splits):
= prc['Adj Close'].dropna()
aprc aprc
AAPL | MSFT | |
---|---|---|
Date | ||
2020-09-01 | 114.239166 | 206.105713 |
2020-10-01 | 107.383453 | 198.403580 |
2020-11-01 | 117.435226 | 209.770584 |
2020-12-01 | 131.116043 | 218.523499 |
Note that, if we download data for a single stock, this will return a pandas Series, not a DataFrame:
= yf.download(tickers = "MSFT", start = '2020-09-01', end = '2020-12-31', interval = '1mo')['Adj Close'].dropna()
prc2 prc2
[*********************100%***********************] 1 of 1 completed
Date
2020-09-01 206.105698
2020-10-01 198.403580
2020-11-01 209.770599
2020-12-01 218.523514
Name: Adj Close, dtype: float64
type(prc2)
pandas.core.series.Series
As mentioned before, we will usually turn Series into dataframes before continuing to work with them further:
= prc2.to_frame()
aprc2 aprc2
Adj Close | |
---|---|
Date | |
2020-09-01 | 206.105698 |
2020-10-01 | 198.403580 |
2020-11-01 | 209.770599 |
2020-12-01 | 218.523514 |
The “wrds” package
The wrds
package allows us to download data directly from the WRDS database. Unfortunately, this functionality is not available for class accounts like the ones I created for this course. I only mention this package for PhD students, who should be able to use this package with their own individual WRDS credentials. The documentation for this package is found here:
https://pypi.org/project/wrds/